knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(plyr))
suppressPackageStartupMessages(library(plotly))
suppressPackageStartupMessages(library(shiny))
options(digits = 2)
knitr::knit_hooks$set(inline = function(x) {
  knitr:::format_sci(x, "md")
})

Introduction

We are exploring an already cleaned Gapminder dataset. Where we will extract relevant and influential information from the data to help us get an idea of what is happening within the dataset. We have been provided with particular questions the researcher would like us to look into. A first half of the analysis will be guided, as in the researcher has provided what they want for us to do. The latter portion of the analysis will be done unguided. Questions will be provided, however, how we go about our analysis will be dependent upon us.

DataExplanation

The data was downloaded from Gapminder.org which is a website that takes data from multiple sources and joins them together to give you a combined and centralized time-series data source. The particular data we are looking at comes from (what I believe to be) the GDP per capita in constant PPP dollars data. This data set explores the Gross Domestic Product (GDP) per capita of each country by year.

Data Exploration (Guided)

# load the data into a tibble
data <- read_csv("gapminder_clean.csv")
data <- dplyr::rename(data, "index" = "...1")

To begin we have been asked to look at the year 1962. In particular the researcher would like us to graph and check the correlation between the GDP per cap of each country and the amount of CO2 emissions they expelled between the two columns. We can first subset our data to only include the year 1962. Once we have our subset we can then simply create a scatter plot that includes a linear model line applied to it. This way we can see if the two variables seem to be correlated.

1962 Data

# filter data to Year 1962 only
data.year <- filter(data, Year == 1962)

# scatter plot
ggplot(data.year, aes(`CO2 emissions (metric tons per capita)`, `gdpPercap`)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "CO2 emissions vs gdpPercap Scatter Plot w/ lm fit model\nYear filtered to only 1962")

# generate the correlation
cor.year <- cor.test(data.year$`CO2 emissions (metric tons per capita)`, data.year$gdpPercap)

# assign them to variables
p.value <- cor.year$p.value
corr <- cor.year$estimate[[1]]

Now that we see how the two columns look compared against each other for the year of 1962. We can see by the linear model, (represented by a line in the plot), it looks like the two variables have a high correlation between the two of them. Though visually they seem to be highly correlated, one cannot draw conclusions from this plot. Instead we ran a correlation test on the two columns for a quantitative answer to the question. The correlation between the two columns is 0.93 and a \(p\)-value of 1.13× 10-46. This tells us that our data is correlated and is very significant.

Find highest Correlation Data by Year

We are curious to see what year, within our data, has the highest correlation between CO2 emission and GDP per capita. We can do that by grouping the data by Year and then running a correlation test on each group’s CO2 emission by GDP per cap columns. We can turn that into a data.frame and then sort it to grab the highest correlated Year.

# Grab the highest year
high.corr <- data %>%
  group_by(Year) %>%
  summarise(estimate = cor.test(`CO2 emissions (metric tons per capita)`, gdpPercap)$estimate) %>%
  slice_max(order_by = estimate) %>%
  inner_join(data, by = "Year")

highest.year <- high.corr$Year[1]

Once we run this correlation test we find out that the year 1967 has the highest correlation between CO2 emissions and GDP per cap. We are asked to produce an interactive scatter plot for the Year, so that we can better explore the data within this year.

ggplotly(ggplot(high.corr, aes(x = `CO2 emissions (metric tons per capita)`, y = `gdpPercap`)) +
  geom_point(aes(color = continent, size = pop)) +
  labs(title = "CO2 emission vs gdpPercap grouped by continent\nSubset by Year 1967"))

Independent Questions

Since we are now going to analyze these next questions on our own, lets first get some metrics of the total dataset. We have 2607 rows in and 20 columns in our data. There are a total of 14651 NA points within our data. That is 28.1% of the data. This is a sizeable chunk of our data. Which means during our analysis we may have to remove or fill in the NA values.

Question 1

The researcher would like to know if there is a relationship between continent and Energy use (kg of oil equivalent per capita). We know that Energy use (kg of oil equivalent per capita) is a continuous variable and continent is a categorical variable. Let’s first get a sense of our continuous variable, we can plot the histogram of the column and see how it looks.

# Check the distribution of the data
hist(data$`Energy use (kg of oil equivalent per capita)`, main = "Histogram of All Energy Use", xlab = "Energy use (kg of oil equivalent per capita)", breaks = 100)

Looking at the histogram above, we can see the right skew of our continuous dependent variable. The histogram shows us that we have a lot of zero and low values in our data. We can also see that there are quite a few outliers within our data.

Now that we have a sense of the dependent variable. We can look at our independent variable which is categorical. However, if we check to see the amount of NAs in the column we can see that there are a lot of missing values in continent. There are 1323 NA values in the continent column. Though within the Country Name column we have no missing values. Which means that we could download a dataset that associates Country Name with their continent. This would allow us to fill in hopefully a large portion of the data, with minimal effort.

# import country df
continent.df <- read_csv("continents2.csv")
continent.df <- rbind(continent.df, list("Arab World", NA, NA, NA, NA, "Asia", NA, NA, NA, NA, NA))

filled_data <- data %>% left_join(continent.df[, c("name", "region")], by = c("Country Name" = "name"))

Once we bring in the other data frame and use it to fill in a new column called region we get 774 NA values. Which is a huge improvement from before. Let’s graph our Energy use (kg of oil equivalent per capita) column by the newly filled in continent column. While also getting rid of the NA rows within region, so they don’t confuse the analysis.

# Barplot grouped by continent
sub <- data[with(data, !is.na(`Energy use (kg of oil equivalent per capita)`) &
  !is.na(continent)), ]
ggplotly(ggplot(sub, aes(x = continent, y = `Energy use (kg of oil equivalent per capita)`)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Energy use by continent, 1962-2007"))

We can see from the plot above that the data is almost Gaussian looking with Asia being the highest Energy user with the Americas and Europe almost tied for second. We can visually see that there is a large difference between the different continents. Though we should apply an ANOVA test to this data in order to get an actual quantitative value to see if there is a significant difference between groups.

# anova model
aov.model <- aov(`Energy use (kg of oil equivalent per capita)` ~ continent, data = sub)
p.value <- summary(aov.model)[[1]][["Pr(>F)"]][1]

When we run our data through an ANOVA model we can see that we get a significant difference between Energy use by continent. With a \(p\)-value of 8.53× 10-39. However, when we run an ANOVA test, we only have a high-level overview of the data. It may be more informative for us to look and see which comparisons were significant and which group comparisons weren’t. We can run a Tukey HSD for our multiple comparison test.

# tukeyHSD setting margins for the plot
par(mar = c(6, 8, 3, 1))
plot(TukeyHSD(aov.model), las = 1)

Looking at the output from the Tukey HSD model, we can see that the only non-significant comparisons are Oceania-Asia, Oceania-Europe, and Oceania-Americas. All other comparisons are significant. This is interesting as Oceania was the second lowest Energy Use of the five. The significant difference that is observed for Oceania is when it is compared to it’s fellow low Energy user Africa.

Question 2

We have been asked to check the difference between Europe and Asia after the year 1990 in regard to 'Imports of goods and services (% of GDP)'. The first thing we need to do is pull out the data from the year 1990 and after. Then we can sort by continent and make sure to only grab Europe and Asia.

# Subset data frame
after.1990 <- data %>% filter(Year >= 1990 & continent %in% c("Europe", "Asia"))

Once we subset we get 224 rows within our subset. Now that we have our data subset we can visualize the difference between the two continents and run a welche’s two sample t-test. We have two independent continuous groups making the t-test the ideal model to compare the means of these two groups.

# first graph boxplot
ggplotly(ggplot(after.1990, aes(x = continent, y = `Imports of goods and services (% of GDP)`)) +
  geom_boxplot() +
  labs(title = "Boxplot to compare the mean's of Asia and Europe"))
# Then t-test
test.out <- t.test(`Imports of goods and services (% of GDP)` ~ continent, data = after.1990)

We can see from the graph above that the means are extremely close to each other, indicating that most likely our two groups are not significantly different from each other. Though we should quantitatively describe this relationship with the mean’s of the two groups. Asia’s mean is 46.85 and Europe’s mean is 41.79. The \(p\)-value from the t.test is 0.18, which is not significant at all. This confirms what we are seeing in the boxplot.

Question 3

We have been tasked with finding out which of the countries has the highest average 'Population density (people per sq. km of land area)' across all years. We can find that out by creating a subset of the data by year and then grab the countries that have the max amount for that year. Then you can create a barplot to visualize the top countries and get a frequency table to figure out which of the countries on average is high throughout the years.

# Grab the max pops for each year
avg_country <- data %>%
  group_by(Year) %>%
  slice_max(order_by = `Population density (people per sq. km of land area)`)

# plotly graph
ggplotly(ggplot(avg_country, aes(x = Year, y = `Population density (people per sq. km of land area)`, fill = `Country Name`)) +
  geom_bar(stat = "identity") +
  labs(title = "Max Population Density by Year since 1962") +
  theme(plot.title = element_text(hjust = 0.5)))

Looking at the bar graph above, we can see that the top population density per year oscillates between only two options. Macao SAR, China (red) and Monaco (blue). We can see that Macao SAR, China reached a higher population in general. However, if we look at the number of times the two appear, we can see that they both appear five times. Meaning that for the answer to the question which country/countries have the highest average 'Population density (people per sq. km of land area)' we have both Macao SAR, China and Monaco as the two highest average population density across all of the available years in the dataset.

Question 4

Lastly we want to know which of the countries in our dataset has shown the greatest increase in 'Life expectancy at birth, total (years)' since the beginning of our dataset (1962). We can find that out by creating a separate dataset that has only the year 1962 (lower bound of the Year in our data) and 2007 (upper bound of the Year in our data). Once we have this subset we can simply subtract the upper bound minus the lower bound and get our change over time.

# grab the difference between 2007 and 1962 in a dataframe
y <- data %>%
  filter(Year %in% c(1962, 2007) & !is.na(`Life expectancy at birth, total (years)`)) %>%
  group_by(`Country Name`) %>%
  filter(n() == 2) %>%
  mutate(diff_nums = `Life expectancy at birth, total (years)`[Year == 2007] - `Life expectancy at birth, total (years)`[Year == 1962]) %>%
  filter(Year == 2007)

# order data
order.total <- y[order(y$diff_nums, decreasing = TRUE), ]
top.five <- order.total[1:5, ]

# Data
ggplotly(ggplot(y, aes(x = reorder(`Country Name`, diff_nums), y = diff_nums)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Country Life Expectancy Increase from 1962-2007",
    x = "Countries", y = "Increase of Life Expectancy"
  ) +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  ))
knitr::kable(top.five[, c("Country Name", "diff_nums")])
Country Name diff_nums
Maldives 37
Bhutan 33
Timor-Leste 31
Tunisia 31
Oman 31

After taking the difference between the Life Expectancy of 1962 to 2007 for each country. We then sorted them and pulled out the top five countries. We can see from the table and the graph that Maldives is the highest increasing country in Life Expectancy between the years of 1962 and 2007. What is more interesting from the graph is the countries on the far left that have actually gotten worse in life expectancy.

top.full.five <- data %>% filter(`Country Name` %in% c(
  "Maldives", "Bhutan",
  "Timor-Leste", "Tunisia", "Oman", "Nepal"
))

ggplotly(ggplot(top.full.five, aes(
  x = Year, y = `Life expectancy at birth, total (years)`,
  color = `Country Name`, group = `Country Name`
)) +
  geom_line() +
  geom_point() +
  labs(title = "Life Expectancy Increasing from 1962 to 2007"))

If we graph the top five we can see that the Maldives just barely beats out some of the other top five. However, Maldives has the largest change over the years.

Conclusion

Exploring the Gapminder data we saw that there was a significant correlation between CO2 emissions and gdpPercap in the year 1962. Though we ultimately found out that the year 1967 had the highest correlation between CO2 emissions and gdpPercap. We explored Energy use between continents and showed that there was a significant difference between the Energy use between them. We saw that there was no difference between Asia and Europe and the Imports of goods and services. Over the years we figured out that Monaco and Macao SAR, China went back and forth between having the highest Population density over the years 1962 to 2007. Then finally we concluded that the Maldives has had the highest increase in Life Expectancy between 1962 to 2007 out of the countries present within our dataset.